18 research outputs found
Computational modeling of turn-taking dynamics in spoken conversations
The study of human interaction dynamics has been at the center for multiple research disciplines in- cluding computer and social sciences, conversational analysis and psychology, for over decades. Recent interest has been shown with the aim of designing computational models to improve human-machine interaction system as well as support humans in their decision-making process. Turn-taking is one of the key aspects of conversational dynamics in dyadic conversations and is an integral part of human- human, and human-machine interaction systems. It is used for discourse organization of a conversation by means of explicit phrasing, intonation, and pausing, and it involves intricate timing. In verbal (e.g., telephone) conversation, the turn transitions are facilitated by inter- and intra- speaker silences and over- laps. In early research of turn-taking in the speech community, the studies include durational aspects of turns, cues for turn yielding intention and lastly designing turn transition modeling for spoken dia- log agents. Compared to the studies of turn transitions very few works have been done for classifying overlap discourse, especially the competitive act of overlaps and function of silences.
Given the limitations of the current state-of-the-art, this dissertation focuses on two aspects of con- versational dynamics: 1) design automated computational models for analyzing turn-taking behavior in a dyadic conversation, 2) predict the outcome of the conversations, i.e., observed user satisfaction, using turn-taking descriptors, and later these two aspects are used to design a conversational profile for each speaker using turn-taking behavior and the outcome of the conversations. The analysis, experiments, and evaluation has been done on a large dataset of Italian call-center spoken conversations where customers and agents are engaged in real problem-solving tasks.
Towards solving our research goal, the challenges include automatically segmenting and aligning speakers’ channel from the speech signal, identifying and labeling the turn-types and its functional aspects. The task becomes more challenging due to the presence of overlapping speech. To model turn- taking behavior, the intension behind these overlapping turns needed to be considered. However, among all, the most critical question is how to model observed user satisfaction in a dyadic conversation and what properties of turn-taking behavior can be used to represent and predict the outcome.
Thus, the computational models for analyzing turn-taking dynamics, in this dissertation includes au- tomatic segmenting and labeling turn types, categorization of competitive vs non-competitive overlaps, silences (e.g., lapse, pauses) and functions of turns in terms of dialog acts.
The novel contributions of the work presented here are to
1. design of a fully automated turn segmentation and labeling (e.g., agent vs customer’s turn, lapse within the speaker, and overlap) system.
2. the design of annotation guidelines for segmenting and annotating the speech overlaps with the competitive and non-competitive labels.
3. demonstrate how different channels of information such as acoustic, linguistic, and psycholin- guistic feature sets perform in the classification of competitive vs non-competitive overlaps.
4. study the role of speakers and context (i.e., agents’ and customers’ speech) for conveying the information of competitiveness for each individual feature set and their combinations.
5. investigate the function of long silences towards the information flow in a dyadic conversation.
The extracted turn-taking cues is then used to automatically predict the outcome of the conversation, which is modeled from continuous manifestations of emotion. The contributions include
1. modeling the state of the observed user satisfaction in terms of the final emotional manifestation of the customer (i.e., user).
2. analysis and modeling turn-taking properties to display how each turn type influence the user satisfaction.
3. study of how turn-taking behavior changes within each emotional state.
Based on the studies conducted in this work, it is demonstrated that turn-taking behavior, specially competitiveness of overlaps, is more than just an organizational tool in daily human interactions. It represents the beneficial information and contains the power to predict the outcome of the conversation in terms of satisfaction vs not-satisfaction. Combining the turn-taking behavior and the outcome of the conversation, the final and resultant goal is to design a conversational profile for each speaker. Such profiled information not only facilitate domain experts but also would be useful to the call center agent in real time.
These systems are fully automated and no human intervention is required. The findings are po- tentially relevant to the research of overlapping speech and automatic analysis of human-human and human-machine interactions
What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis
End-to-end DNN architectures have pushed the state-of-the-art in speech
technologies, as well as in other spheres of AI, leading researchers to train
more complex and deeper models. These improvements came at the cost of
transparency. DNNs are innately opaque and difficult to interpret. We no longer
understand what features are learned, where they are preserved, and how they
inter-operate. Such an analysis is important for better model understanding,
debugging and to ensure fairness in ethical decision making. In this work, we
analyze the representations trained within deep speech models, towards the task
of speaker recognition, dialect identification and reconstruction of masked
signals. We carry a layer- and neuron-level analysis on the utterance-level
representations captured within pretrained speech models for speaker, language
and channel properties. We study: is this information captured in the learned
representations? where is it preserved? how is it distributed? and can we
identify a minimal subset of network that posses this information. Using
diagnostic classifiers, we answered these questions. Our results reveal: (i)
channel and gender information is omnipresent and is redundantly distributed
(ii) complex properties such as dialectal information is encoded only in the
task-oriented pretrained network and is localised in the upper layers (iii) a
minimal subset of neurons can be extracted to encode the predefined property
(iv) salient neurons are sometimes shared between properties and can highlights
presence of biases in the network. Our cross-architectural comparison indicates
that (v) the pretrained models captures speaker-invariant information and (vi)
the pretrained CNNs models are competitive to the Transformers for encoding
information for the studied properties. To the best of our knowledge, this is
the first study to investigate neuron analysis on the speech models.Comment: Submitted to CSL. Keywords: Speech, Neuron Analysis,
Interpretibility, Diagnostic Classifier, AI explainability, End-to-End
Architectur
The complementary roles of non-verbal cues for Robust Pronunciation Assessment
Research on pronunciation assessment systems focuses on utilizing phonetic
and phonological aspects of non-native (L2) speech, often neglecting the rich
layer of information hidden within the non-verbal cues. In this study, we
proposed a novel pronunciation assessment framework, IntraVerbalPA. % The
framework innovatively incorporates both fine-grained frame- and abstract
utterance-level non-verbal cues, alongside the conventional speech and phoneme
representations. Additionally, we introduce ''Goodness of phonemic-duration''
metric to effectively model duration distribution within the framework. Our
results validate the effectiveness of the proposed IntraVerbalPA framework and
its individual components, yielding performance that either matches or
outperforms existing research works.Comment: 5 pages, submitted to ICASSP 202
Automatic Pronunciation Assessment -- A Review
Pronunciation assessment and its application in computer-aided pronunciation
training (CAPT) have seen impressive progress in recent years. With the rapid
growth in language processing and deep learning over the past few years, there
is a need for an updated review. In this paper, we review methods employed in
pronunciation assessment for both phonemic and prosodic. We categorize the main
challenges observed in prominent research trends, and highlight existing
limitations, and available resources. This is followed by a discussion of the
remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding
Multi-View Multi-Task Representation Learning for Mispronunciation Detection
The disparity in phonology between learner's native (L1) and target (L2)
language poses a significant challenge for mispronunciation detection and
diagnosis (MDD) systems. This challenge is further intensified by lack of
annotated L2 data. This paper proposes a novel MDD architecture that exploits
multiple `views' of the same input data assisted by auxiliary tasks to learn
more distinctive phonetic representation in a low-resource setting. Using the
mono- and multilingual encoders, the model learn multiple views of the input,
and capture the sound properties across diverse languages and accents. These
encoded representations are further enriched by learning articulatory features
in a multi-task setup. Our reported results using the L2-ARCTIC data
outperformed the SOTA models, with a phoneme error rate reduction of 11.13% and
8.60% and absolute F1 score increase of 5.89%, and 2.49% compared to the
single-view mono- and multilingual systems, with a limited L2 dataset.Comment: 5 page
SpeechBlender: Speech Augmentation Framework for Mispronunciation Data Generation
One of the biggest challenges in designing mispronunciation detection models
is the unavailability of labeled L2 speech data. To overcome such data
scarcity, we introduce SpeechBlender -- a fine-grained data augmentation
pipeline for generating mispronunciation errors. The SpeechBlender utilizes
varieties of masks to target different regions of a phonetic unit, and use the
mixing factors to linearly interpolate raw speech signals while generating
erroneous pronunciation instances. The masks facilitate smooth blending of the
signals, thus generating more effective samples than the `Cut/Paste' method. We
show the effectiveness of our augmentation technique in a phoneme-level
pronunciation quality assessment task, leveraging only a good pronunciation
dataset. With SpeechBlender augmentation, we observed a 3% and 2% increase in
Pearson correlation coefficient (PCC) compared to no-augmentation and goodness
of pronunciation augmentation scenarios respectively for Speechocean762
testset. Moreover, a 2% rise in PCC is observed when comparing our single-task
phoneme-level mispronunciation detection model with a multi-task learning model
using multiple-granularity information.Comment: 5 pages, submitted to ICASSP 202